7 research outputs found
Automating the Surveillance of Mosquito Vectors from Trapped Specimens Using Computer Vision Techniques
Among all animals, mosquitoes are responsible for the most deaths worldwide.
Interestingly, not all types of mosquitoes spread diseases, but rather, a
select few alone are competent enough to do so. In the case of any disease
outbreak, an important first step is surveillance of vectors (i.e., those
mosquitoes capable of spreading diseases). To do this today, public health
workers lay several mosquito traps in the area of interest. Hundreds of
mosquitoes will get trapped. Naturally, among these hundreds, taxonomists have
to identify only the vectors to gauge their density. This process today is
manual, requires complex expertise/ training, and is based on visual inspection
of each trapped specimen under a microscope. It is long, stressful and
self-limiting. This paper presents an innovative solution to this problem. Our
technique assumes the presence of an embedded camera (similar to those in
smart-phones) that can take pictures of trapped mosquitoes. Our techniques
proposed here will then process these images to automatically classify the
genus and species type. Our CNN model based on Inception-ResNet V2 and Transfer
Learning yielded an overall accuracy of 80% in classifying mosquitoes when
trained on 25,867 images of 250 trapped mosquito vector specimens captured via
many smart-phone cameras. In particular, the accuracy of our model in
classifying Aedes aegypti and Anopheles stephensi mosquitoes (both of which are
deadly vectors) is amongst the highest. We present important lessons learned
and practical impact of our techniques towards the end of the paper
Evaluating Multiway Multilingual NMT in the Turkic Languages
Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe
A Large-Scale Study of Machine Translation in Turkic Languages
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
With the success of large-scale pre-training and multilingual modeling in
Natural Language Processing (NLP), recent years have seen a proliferation of
large, web-mined text datasets covering hundreds of languages. We manually
audit the quality of 205 language-specific corpora released with five major
public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource
corpora have systematic issues: At least 15 corpora have no usable text, and a
significant fraction contains less than 50% sentences of acceptable quality. In
addition, many are mislabeled or use nonstandard/ambiguous language codes. We
demonstrate that these issues are easy to detect even for non-proficient
speakers, and supplement the human audit with automatic analyses. Finally, we
recommend techniques to evaluate and improve multilingual corpora and discuss
potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio
Turkic Interlingua: A Case Study of Machine Translation in Low-resource Languages
Machine Translation (MT) has the potential to bridge the gap between the developed world and the marginalized communities by making information more accessible in real-time. While there are over 7000 spoken languages in the world, only about a hundred have access to high-quality MT systems and even fewer enjoy the benefits of more advanced language technologies. Unfortunately, resource scarcity and the lack of digital infrastructure are only some of the many challenges associated with globalizing NLP. Many large-scale multilingual studies and datasets often get little to no feedback from native speakers or linguistic experts of the languages involved, leading to serious problems of data quality and potential biases. In this thesis, we present a case study of participatory research in 22 Turkic languages involving native speakers, language technologists, researchers, linguists, commercial entities, and more. Through this thesis, we compile and release the largest public corpus for MT in Turkic languages along with 26 bilingual baseline models. We outline the curation and release of public datasets, the development of machine translation technologies, and their deployment in real-world scenarios. In addition, we discuss the lessons learned through this case study, its applications, and limitations, as well as implications for future projects
Leveraging smart-phone cameras and image processing techniques to classify mosquito genus and species
Identifying insect species integrates image processing, feature selection, unsupervised clustering, and a support vector machine (SVM) learning algorithm for classification. Results with a total of 101 mosquito specimens spread across nine different vector carrying species demonstrate high accuracy in species identification. When implemented as a smart-phone application, the latency and energy consumption were minimal. The currently manual process of species identification and recording can be sped up, while also minimizing the ensuing cognitive workload of personnel. Citizens at large can use the system in their own homes for self-awareness and share insect identification data with public health agencies